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ABSTRACT 

Objectives: The aim of this study was to investigate 
potential statistical relationships between black tea 
consumption and key health indicators in the world. 
The research question is: Does tea consumption is 
correlated with one or more epidemiological 
indicators? 

Design: Ecological study using a systematic data- 
mining approach in which the unit of the analysis is a 
population of one country. 

Setting: Six variables, black tea consumption data and 
prevalence data of respiratory diseases, infectious 
diseases, cancer, cardiovascular diseases and diabetes, 
have been studied at a global level. 
Participants: Data from 50 participating countries in 
the World Health Survey were investigated. 
Primary and secondary outcomes measures: 
Level of statistical relationships between variables. 
Results: Principal component analysis established a 
very high contribution of the black tea consumption 
parameter on the third axis (81%). The correlation 
circle confirmed that the 'black tea' vector was 
negatively correlated with the diabetes vector and was 
not correlated with any of the other four health 
indicators. A linear correlation model then confirmed a 
significant statistical correlation between high black tea 
consumption and low diabetes prevalence. 
Conclusions: This innovative study establishes a 
linear statistical correlation between high black tea 
consumption and low diabetes prevalence in the world. 
These results are consistent with biological and 
physiological studies conducted on the effect of black 
tea on diabetes and confirm the results of a previous 
ecological study in Europe. Further epidemiological 
research and randomised studies are necessary to 
investigate the causality. 
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BACKGROUND 

Various study designs have been used to 
assess the potential benefits of tea. As tea is 
the most widely used ancient hot beverage in 
the world, the simple act of putting tea 
leaves into hot water has provided ancient 
societies with a tasty beverage associated with 
the observation of certain medicinal benefits. 



ARTICLE SUMMARY 



Article focus 

■ This study investigates potential statistical relation- 
ships between black tea (BT) consumption and a 
selection of key health indicators in 50 countries. 

Key messages 

■ A significant linear correlation was established 
between high BT consumption and low diabetes 
prevalence. 

■ These results are consistent with biological, 
physiological and ecological studies conducted on 
the potential effect of BT on diabetes and obesity. 

■ These results should support further causality 
research regarding the health benefits of BT con- 
sumption on type 2 diabetes prevalence in the 
world. 

Strengths and limitations of this study 

■ These original study results are consistent with pre- 
vious biological, physiological and ecological 
studies conducted on the potential effect of BT on 
diabetes and obesity. We believe that this multidi- 
mensional approach provides valuable additional 
scientific information at the global level, as our find- 
ings, establishing a strong correlation between a 
high BT consumption and low diabetes prevalence, 
can be considered to provide a contribution to exist- 
ing studies conducted on tea consumption, diabetes 
and obesity. 

Diabetes prevalence data were obtained from the 
World Health Survey implemented by the WHO, 
which constitutes an official source of key morbidity 
indicators around the world. However, the quality of 
data collection can be expected to be heterogeneous 
around the world and diabetes diagnostic criteria 
can vary from country to country. 
Another important concern is the interpretation of 
the established statistical relationship between BT 
consumption and diabetes prevalence. The 
numerous factors contributing to the growth of 
diabetes and obesity throughout the world 
confirm that 'correlation does not imply causality' 
and that a significant linear correlation between 
BT consumption and diabetes prevalence does 
not imply that low-BT consumption could cause 
diabetes. A correlation can only indicate a poten- 
tial direct or indirect cause, which then needs to 
be further investigated. 
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■ A frequent criticism of the use of data mining is based on the 
confusion between data mining and data dredging techniques. 
While a data mining approach is based on searching for combi- 
nations of variables that might show potential correlations, data 
dredging can generate misleading results. When a number of 
hypotheses are tested, it is expected that some will falsely 
appear to be statistically significant, since every database can 
contain potential random correlations. A robust data mining 
approach must therefore always be based on a clear research 
strategy and a limited number of relevant meaningful 
assumptions. 

■ A classical criticism of this approach is the 'ecological fallacy', 
corresponding to a logical fallacy in interpretation of the 
observed correlations at the population level, assuming that they 
can be applied at the individual level. Our study on BT does not 
comprise any potential logical fallacy, as it was not used as the 
basis for any individual assumptions. 



Two principal varieties of the species are used: the small- 
leaved Chinese variety {Camellia sinensis sinensis), also 
used for green tea and white tea, and the large-leaved 
Assamese variety {Camellia sinensis assamica), which has 
been traditionally used only for black tea (BT) . Ancient 
Chinese civilisations realised that by using a special fer- 
mentation process, tea leaves would become darker 
allowing them to be stored for longer periods of time. 
During this fermentation process, in which green tea 
oxidises to form BT, caffeine tends to remain constant, 
while the types of flavonoids present in the tea differ. 
Green tea contains simple flavonoids called catechins, 
while BT contains complex flavonoids called theaflavins 
and thearubigins, which could be the chemical entities 
responsible for a number of potential health benefits. 
These tea types were called BT because of the change in 
colour of the leaves as a result of this fermentation 
process. Most recent studies use multidisciplinary 
approaches including epidemiology, field studies and 
laboratory research in animal models, mostly for respira- 
tory diseases, infectious diseases, heart diseases, various 
types of cancers and diabetes, as well as in vitro experi- 
ments. 1-9 In respiratory diseases, several tea components 
have been established to be effective in airway diseases. 
Tea catechin polyphenols seems to be effective to 
improve inflammation of obliterative airway disease, 10 
protect against oxidative damage and apoptosis in 
human bronchial epithelial cells induced by tobacco or 
attenuate oxidative responses to intermittent hypoxia 
(Burckardt, 2008). In infectious diseases, herbal pro- 
ducts have gained considerable interest among pharma- 
ceutical companies and consumers due to the minimal 
perceived side effects associated with these products. 
Several antimicrobial activities have been attributed to 
tea flavonoids. Catechins appear to have virucidal and 
virustatic actions 11 and appear to exert a protective activ- 
ity against Vibrio cholerae. 12 However, research into the 
potential beneficial effects of tea appears to be most 
active in the field of cardiovascular diseases, in view of 
the number publications in this field. Most of these 



publications tend to confirm that tea catechins would 
exert cardioprotective effects via various mechanisms 
including reversal of endothelial dysfunctions, reduction 
of inflammatory biomarkers, and antioxidant, antiplate- 
let and antiproliferative effects. 13 Moreover, dietary con- 
sumption of tea catechins would have beneficial effects 
on blood pressure and lipid parameters. 14 Similarly, a 
number of studies have focused on the potential effects 
of tea in cancer. Biochemical and biological studies, pro- 
spective cohort studies and double-blind randomised 
clinical prevention trials tend to show convergent results 
for the beneficial preventive effects of tea components 
in various cancers such as hepatocellular carcinoma, 
skin, prostate, lung or colorectal cancer. 15 Antidiabetes 
properties of BT have been suggested by several very spe- 
cific studies, such as a change in pancreatic function in 
streptozotocin-induced glucose-intolerant rats, lb 17 but 
also in some human studies investigating other hot bev- 
erages. 18-21 The relatively recent interest in BT may be 
explained by the fact that BT is historically the type of 
tea most widely consumed in Western countries, prob- 
ably due to its good storage properties, promoting active 
trade with tea-producing countries in Asia. Although 
there has recently been a renewed interest in green tea 
in industrialised countries, BT represents over 90% of all 
tea sold in the West. Despite the number of publications 
investigating the effects of tea components and green 
tea in particular, large-scale clinical dose-effect studies 
are still lacking and it is difficult to interpret the clinical 
significance of results derived from some biological 
studies. Considerably fewer studies have been conducted 
specifically on BT, mostly investigating its antioxidant 
properties, 22 23 and cardiovascular effects. 24 25 It could 
therefore be interesting to investigate some key dietary 
habits in relation to lifestyle and health effects at a global 
level, in view of the perceived positive health effects 
of BT, which have been described for centuries. 26 27 
Because of the complexity of implementing international 
prospective studies and the difficulty of conducting 
meta-analyses on a large number of heterogeneous local 
studies, potential correlations between BT consumption 
and epidemiological data around the world could be 
investigated by using advanced exploratory statistical 
methods. The objective of this original research was to 
investigate potential statistical relationships between BT 
consumption and the following five key health indicators: 
respiratory diseases, infectious diseases, cancer, cardiovas- 
cular diseases and diabetes. 



MATERIAL AND METHOD 
Data sources 

BT consumption data were derived from a specific inter- 
national trade survey compiling sales data conducted in 
2009 by Euromonitor International, an independent 
agency specialised in market research. 28 Consumption 
data are derived from BT international trading registries, 
used by BT importers to adapt international orders to 
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local sales. Yearly consumption data expressed in kilo- 
grams per capita were available for the following 50 
countries: Argentina, Australia, Austria, Belgium, Brazil, 
Bulgaria, Canada, Chile, China, Colombia, Czech 
Republic, Denmark, Egypt, Finland, France, Germany, 
Greece, Hungary, India, Indonesia, Ireland, Israel, Italy, 
Japan, Malaysia, Mexico, Morocco, the Netherlands, 
New Zealand, Norway, Philippines, Poland, Portugal, 
Romania, Russia, Saudi Arabia, Singapore, Slovakia, 



South Africa, South Korea, Spain, Sweden, Switzerland, 
Thailand, Turkey, Ukraine, UK, the USA, Venezuela and 
Vietnam (figure 1). Highest BT consumptions (kg/year 
per inhabitant) are observed in Ireland (2.1576), UK 
(1.8137), Turkey (1.6631) and Russia (1.0668). Lowest 
BT consumptions are observed in South Korea (0.0007), 
Brazil (0.001) and China (0.0011), as the Chinese popu- 
lation drinks 30 times more green tea (0.036 kg per 
inhabitant) than BT. Epidemiological data were derived 



Figure 1 2009 Black tea 
consumption data in kg/year per 
inhabitant (source: Euromonitor). 
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Table 1 


Sample of the dataset presenting the five key health indicators (rate per 100 000 inhabitants) and tea consumption 


in eight countries (kg per 100 000 inhabitants) 












Respiratory 


Infectious diseases 




Cardiovascular 




Black tea 


Country 


diseases 


(tuberculosis and HIV) 


Cancers 


diseases 


Diabetes 


consumption 


Indonesia 


2063 


306 


776 


1063 


5639 


30710 


Romania 


0007 




2361 


3399 


6772 


590 


Russia 


2394 


748 




A 1 1 Q 
H I I O 


4UOU 


I UOOOU 


Hungary 


2505 


62 


2204 


4685 


5927 


11270 


Ukraine 


2552 


857 


2245 


4630 


4612 


32290 


Turkey 


2931 


48 


1271 


1579 


3326 


166310 


Egypt 


3121 


40 


615 


1316 


3979 


95910 


Saudi 


3221 


54 


353 


914 


4257 


57020 


Arabia 

















from a specific extraction from the World Health Survey 
(WHS) conducted by the WHO. Each year, the WHS 
compiles comprehensive baseline information on the 
health of populations and health system outcomes. 29 
Using the 2009 dataset (sample presented in table 1), 
five key health indicators were selected in 50 countries 
in both men and women for all age groups: prevalence 
of respiratory diseases, prevalence of infectious diseases 
(tuberculosis and HIV), prevalence of cancer, preva- 
lence of cardiovascular diseases and prevalence of 
diabetes. 

Methods 

This ecological study used a data mining approach struc- 
tured in three phases: first, a 'calibrated principal com- 
ponent analysis' (PGA) was used to segment the 
database composed of six variables (BT consumption 
and the five health indicators) into three synthetic 
dimensions represented by three axes which can be con- 
sidered as the mathematical projection of the six dimen- 
sions defined by the six variables into three dimensions; 
second, the six variables were represented as vectors in a 
'correlation circle' to study potential positive or negative 
correlations; finally, a linear correlation model was 
tested on selected variables. 

Normative PCA 

PGA is a mathematical procedure that uses mathematical 
projections to convert a set of n possibly correlated vari- 
ables representing n dimensions into a smaller number 
of dimensions called 'principal components' classically 
represented in two or three axes Fl, F2 and F3. The pro- 
jections use orthogonal transformations defined in such 
a way that the first principal component (first axis) has 
the highest possible variance in order to synthesise 
most of the initial information. The main objective of 
PGA is to reduce the dimensionality of the dataset. PCA 
is often presented as a technique of factor analysis 
for quantitative variables. Multiple correspondence ana- 
lysis is another type of factor analysis for quantitative, 
qualitative and categorical variables and is useful to 
conduct multicriteria analyses such as multicriteria risk 



assessment. 30 A 'normative PCA.' was selected for our 
study, as the six variables (BT consumption per capita 
and five key health indicators) are quantitative variables 
and this analysis was calibrated to study potential 
correlations. 

Correlation circle 

The correlation circle shows a projection of the initial 
variables in a dimensional space which can be repre- 
sented in two or three dimensions. 31 Variables are pre- 
sented as vectors from the centre. When two vectors are 
close to the correlation circle, they can be: (1) close to 
each other, meaning a positive correlation (2) orthog- 
onal from each other, meaning that they are not corre- 
lated and (3) on the opposite side from the centre, 
meaning a significant negative correlation. When some 
vectors are close to the centre, this means that some 
information is carried on other axes, and that any inter- 
pretation might be hazardous. The correlation circle is 
then used to identify the potential proximity with the six 
vectors and to assess their potential correlations. Should 
a vector representing the variable 'BT consumption' be 
close to the correlation circle and point to a similar dir- 
ection compared with any of the other five vectors repre- 
senting health indicators, this would indicate a positive 
correlation between the two variables. 

Linear correlation model 

Once identified by the correlation circle, potential cor- 
relations between BT consumption and one or more 
health indicators can be described using key statistical 
parameters, such as the coefficient r 2 and the statistical 
significance p. Using a linear correlation model between 
BT consumption and one health indicator then deter- 
mines the extent to which the values of these two vari- 
ables are potentially 'proportional' to each other (BT 
consumption increases or decreases with one specific 
disease prevalence) . The linear model formula is: 
y=ax+b (y, health indicator; x, BT consumption; a and b 
are the model calculated coefficients). The structure of 
this formula suggests that the variables x and y are 
linearly related and thus proportional; that is, the 
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correlation is high if it can be represented by a straight 
line (upwards or downwards slope). If so, this line will 
represent the linear model, also called a 'regression 
line' or 'least squares line' because the sum of the 
squared distances of all the data points from the line is 
the lowest possible. The coefficient r 2 (coefficient of 
determination) represents the proportion of common 
variations between the two variables and establishes the 
'strength' of the relationship. In order to evaluate the 
potential correlation between BT consumption and one 
specific health indicator, it is therefore important to 
know r 2 , the statistical significance p of the correlation 
(calculated by a Fisher-Snedecor test) and the statistical 
significance of the difference to 0 of the coefficient 'a' 
(Student's t test) . 



RESULTS 

The database was composed of 300 fields representing 
six variables (five health indicators and BT consump- 
tion) in 50 countries. Using normative PGA on this 



multidimensional table, the overall 'quality' (percentage 
of original variance) of the final projection from six 
dimensions (six variables) was 59% in two dimensions 
and 74% when projected in three dimensions. This con- 
firms that the best representation of the dataset should 
be in three dimensions, which can be described by axes 
entitled Fl, F2 and F3. The 'BT consumption' variable 
provided a high contribution to the construction on axis 
F3 (81%). The angle of the vector 'BT consumption' 
with axis F3 was only 22°, confirming the very high con- 
tribution of this variable on axis F3. Forty-two of the 50 
countries were related to this axis F3. The eight coun- 
tries not contributing to F3 were Brazil, China, 
Venezuela, Morocco, Colombia, Vietnam, Philippines 
and Israel, suggesting the absence of any correlation 
between BT consumption and health indicators in these 
particular countries. 

The 'correlation circle' (figure 2) shows that the 'BT 
consumption' vector was strictly opposite the 'Diabetes 
prevalence' vector, establishing a strong statistical nega- 
tive correlation. Vectors concerning the other key health 



Jrdaxis 



infectious diseases 



2d axis 




► 1st a 



*In this three-dimensional representation, the "infectious disease" vector seems to be close to the BT vector, but is 
actually represented by a large angle in the third dimension, confirming the poor meaningful correlations between 
the "infectious diseases " and "B T consumption " variables. 

Figure 2 Three-dimensional correlation circle of five health indicators and black tea (BT) consumption. In this three-dimensional 
representation, the 'infectious disease' vector seems to be close to the BT vector, but is actually represented by a large angle in the 
third dimension, confirming the poor meaningful correlations between the 'infectious diseases' and the 'BT consumption' variables. 
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Figure 3 Linear correlation model between black tea 
consumption (kg per 100 000 inhabitants) and diabetes 
prevalence (cases per 100 000). 



indicators (infectious diseases, respiratory diseases, 
cancer and cardiovascular diseases) were represented 
with a large angle (close to orthogonal) compared with 
the BT vector, confirming poor statistical relationships 
between BT and these four health indicators. Of particu- 
lar interest was the interpretation of the 'infectious 
disease' vector, which seemed to be close to the BT 
vector in a two dimensions projection, but was actually 
represented by a large angle in the third dimension. 
The infectious disease vector was also closer to the 
centre of the correlation circle, confirming the poor 
meaningful correlations and potentially hazardous inter- 
pretations. Consequently, among the five health indica- 
tors selected, only the diabetes parameter was correlated 
with BT consumption and can be submitted to discus- 
sion and interpretation. No valid interpretations can be 
derived from the other four health indicators using this 
dataset. 

Then linear correlation model with the format y=ax+b 
is represented by the following formula and is presented 
in figure 3: 

Diabetes prevalence = —0.0171183 x BT consumption 

+ 6173.64 



The y-coordinate of the point at which the regression 
line intersects the y-axis (intercept) can be considered 
to correspond to the average prevalence of diabetes in a 
country in which BT consumption is unknown (6173 
cases per 100 000 inhabitants). Based on 42 countries, 
the p value of the Fisher-Snedecor test was 0.003, which 
is highly significant, confirming the relevance of the 
linear model. The coefficient r 2 was equal to 0.199. 
Student's t test confirmed that the slope coefficient 
(0.0171183) was significantly different from 0 (p=0.003) 
with a 5% CI between (-0.028 and -0.006). The negativ- 
ity of coefficient 'a' indicates that diabetes prevalence 
decreases as BT consumption increases, confirming a 
negative correlation. 



DISCUSSION 
Limitations 

This study establishes an inverse linear statistical relation- 
ship between high BT consumption and diabetes preva- 
lence in the world, and confirms the findings of the 
European ecological study establishing a similar relation- 
ship. 32 As in any database analysis, the very first limita- 
tion of this study is related to the quality of the data. 
WHO prevalence data were obtained from the WHS, 
which constitute a convenient and official source of key 
morbidity indicators around the world. The general 
design of the WHS is based on population sampling 
organised in the 192 Member States of the United 
Nations using face-to-face or telephone interviews. As 
the survey questionnaire offers a menu of choices of 
modules for each country, and lets the country select 
the survey approach (Household face-to-face survey, 
Computer-Assisted Telephone Interview or Computer- 
Assisted Personal Interview), the quality of data collec- 
tion can be expected to be heterogeneous around 
the world. 

Furthermore, some of the selected health indicators 
represent a group of diseases, such as infectious diseases 
(tuberculosis and HIV) and cancer. The heterogeneity 
of these indicators can make it difficult to establish any 
potential statistical relationships. Although more homo- 
geneous, health indicators such as diabetes depend on 
diagnostic criteria, which can vary across countries. On 
the other hand, any fixed survey design with fixed cri- 
teria would not be appropriate everywhere, for example 
in countries with low telephone network coverage when 
planning telephone interviews. Other approaches to 
estimate prevalence of diabetes in the world have 
been studies using literature and data extrapolations, 33 
confirming the growing burden of diabetes. Another 
important concern is the interpretation of the estab- 
lished statistical relationship between BT consumption 
and diabetes prevalence. Using a systematic data mining 
approach, we tested the potential statistical relationship 
between BT consumption and five health indicators, 
without any a priori assumptions in relation to any of 
these health indicators. We observed that, among the 
five health indicators, only the 'prevalence of diabetes' 
indicator appeared to have a strong statistical relation- 
ship with BT consumption. This ecological approach 
considers the population as the unit of analysis rather 
than an individual, which is considered to be inferior to 
case-control studies in the context of evidence-based 
medicine. In an ecological study, no information is avail- 
able about the individual members of the populations 
compared, whereas in a case — control study, information 
is reported for each individual. A classical criticism of 
this approach is the 'ecological fallacy', corresponding 
to a logical fallacy in interpretation of the observed cor- 
relations at the population level, assuming that they can 
be applied at the individual level. It is well-known that 
statistics that accurately describe group characteristics 
do not necessarily apply to individuals within that group. 
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Our study on BT does not comprise any potential 
logical fallacy, as it was not used as the basis for any indi- 
vidual assumptions. However, when interesting and 
strong associations are observed, the results of ecological 
studies have provided numerous assumptions that have 
been subsequently confirmed by experimental studies. 
One of the best-known studies was that published by 
Keys in 1980 34 concerning the relationship with dietary 
habits and coronary heart disease in seven countries. 
The results of what later came to be known as the 
'Seven Countries Study' appeared to show that serum 
cholesterol was strongly related to coronary heart disease 
mortality at both the population and individual levels, 
leading to US government dietetic guidelines. Other 
ecological studies have significantly contributed to scien- 
tific knowledge and public health interventions, such as 
the relationship between lung cancer and tobacco, 
which has been confirmed by numerous studies. 35 For 
these reasons, ecological studies can be very useful for 
international comparisons, while case-control studies 
are exclusively based on local information. Furthermore, 
when strong correlations have been established, the 
results of ecological studies can suggest further evidence- 
based studies, investigating the relevance and mechan- 
ism of the statistical relationship. 

Growing interest of food components that may support 
weight management and glucose metabolism 

Our results confirm the recent 2012 publication from the 
InterAct Consortium which carried out a European eco- 
logical study confirming an inverse linear association 
between tea consumption and the incidence of type 2 
diabetes in Europe. 32 The type 2 diabetes epidemic in 
many countries has stimulated interest in food compo- 
nents that may support weight management. An almost 
six-fold increase in the number of people with diabetes 
has been observed over the last few decades. The 
International Diabetes Federation reports that the 
number of people with diabetes will escalate from 285 to 
438 million between 2010 and 2030 33 and the number of 
persons with IGT will increase from 344 to 472 million. 
By 2030, there will be over 900 million people worldwide 
with diabetes or at high risk of diabetes. Diabetes confers 
about a two-fold excess risk for a wide range of vascular 
diseases. 36 Furthermore, diabetic retinopathy is a 
common and specific microvascular complication of dia- 
betes, and remains the leading cause of preventable 
blindness in working-aged people. 3 With one of the 
highest prevalences of all human diseases, diabetes is 
now a global epidemic with devastating health, social and 
economic consequences. 38 In certain ethnic groups, such 
as Asian populations, diabetes develops at a younger age 
than in Caucasian populations. Several distinctive fea- 
tures are apparent in the pathogenic factors for diabetes 
and their thresholds in Asian populations. 39 In conjunc- 
tion with genetic susceptibility, type 2 diabetes is brought 
on by environmental and behavioural factors such as a 
sedentary lifestyle, overly rich nutrition and obesity and 



results in a huge economic burden. According to WHS 
2009 data, Singapore is the country with the highest 
diabetes prevalence with 12 876 cases per 100 000 inhabi- 
tants, which is mainly observed in the Chinese commu- 
nity and is probably due to the intense urban lifestyle 
in Singapore. 41 Although many laboratory studies have 
observed physiological effects of BT on glucose meta- 
bolism, 17 18 42 43 the underlying mechanisms remain 
unclear. The results of human intervention studies are 
mixed 44 and the role of caffeine has been suggested but 
not clearly established. 43 45 Neyestani et at 2 found that 
regular daily intake of BT improves oxidative stress bio- 
markers and decreases serum C reactive protein levels in 
type 2 diabetic patients. Histological studies on pancreas 
cells published by Manikandan et al 17 concluded that 
the BT extract contributes to regeneration of damaged 
pancreas cells and protects pancreatic fi-cells by its anti- 
oxidant action. Nonetheless, the role of environment, 
dietary and lifestyle practices is fundamental when com- 
paring health indicators around the world. Psaltopoulou 
et at confirmed that low-glycaemic index dietary pat- 
terns reduced both fasting blood glucose and glycated 
proteins independently of carbohydrate consumption. 
Diets rich in whole-grain, cereal high-fibre products and 
non-oil-seed pulses would also be beneficial. As vitamins 
and minerals play an important role in glucose metabol- 
ism, understanding the impact of potential vitamin and 
mineral deficiencies across cultures is also relevant to 
better organisation of prevention and management of 
type 2 diabetes. 47 48 An observational study based on 
nearly 37 000 middle-aged Chinese reported a 14% 
reduction in the risk of developing type 2 diabetes by 
drinking one or more cups of tea per day. 49 This was con- 
firmed by two meta-analyses published by Huxley et at 3 
and Jing et al. 50 Flavonoids are believed to support 
normal glucose metabolism via anti-inflammatory effects 
and increased insulin activity. 51 52 Various studies, espe- 
cially in Asian populations, confirm that flavonoids 
present in green tea could reduce fat absorption in the 
gut, may promote fat oxidation in tissues and may 
increase energy expenditure. 53 An observational study of 
4300 Dutch adults found that flavonoid intake was 
highest in women who gained the least weight over a 
14-year period. 54 Furthermore, as physical activity with or 
without diet contributes to a healthier lifestyle, this 
important factor must be considered when comparing 
health indicators between the industrialised and emer- 
ging countries. Given rapid population growth, increased 
urbanisation, and adverse lifestyle changes, the obesity/ 
type 2 diabetes epidemic in resource-poor nations was 
predicted in the 1990s and has now been fully con- 
firmed, 55 underlying the importance of a better under- 
standing of predictive and potentially protective factors. 

Correlation and causality 

The number of factors contributing to the growth of dia- 
betes and obesity in the world confirms that 'correlation 
does not imply causality', and that a significant linear 
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correlation between BT consumption and diabetes 
prevalence does not imply that low BT consumption 
could cause diabetes. If one factor is established as 
causing another, then the two factors are most certainly 
correlated. However, the opposite cannot be concluded. 
Thus, a correlation can only indicate a potential direct 
or indirect possible cause, which then needs to be 
further investigated. This paradigm and the connota- 
tions of causality may be the most important considera- 
tions affecting biostatistics not only in ecological studies 
but also in major epidemiological study designs. 56 A well- 
known example of epidemiological cause-and-effect mis- 
interpretations is the correlation that was established 
between a hormone replacement therapy and a lower 
incidence of coronary heart disease. This association has 
been more recently explained by the fact that women 
taking hormone replacement therapy were more likely 
to come from higher socioeconomic levels, which could 
explain the lower incidence of coronary heart disease. 57 
Establishing causality is one of the most difficult chal- 
lenges in public health. For instance, in clinical 
research, randomised controlled clinical trials are per- 
formed to establish potential significant differences 
between two groups. However, establishing a difference 
is not a demonstration of causality. Another example is 
case-control studies, which compare individuals with a 
specific disease ('cases') with a group of individuals 
without the disease ('controls'). An association between 
the hypothesised exposure and the disease studied 
would be reflected by a higher proportion in exposed 
cases, but this cannot constitute a real demonstration of 
causality. A potential causality can only be established 
with the convergence of interdisciplinary scientific evi- 
dence (biological, physiological, epidemiological, etc) 
and reasonable explanations based on longitudinal 
studies. In any case, ecological research can address 
important issues that cannot be easily addressed by 
other study designs. Ecological studies are frequently 
used when alternative study designs are not possible (eg, 
randomised control trials), such as when investigating 
the effect of geographical factors on disease incidence. 
Our research, like all ecological studies and most other 
epidemiological approaches, presents a number of lim- 
itations because factors other than dietary habits may be 
the most important determinants of variations in dia- 
betes prevalence across communities. For example, it is 
possible that other unmeasured confounding factors 
(eg, genetic differences) may explain some of the 
observed regional variations. Due to the large number 
of potential determinants of diabetes prevalence, includ- 
ing patient-related, physician-related, hospital-related 
and community-related variables, it is difficult to identify 
with certainty all of the causes of the regional variations 
of diabetes prevalence, and additional follow-up studies 
should be considered to confirm the hypotheses gener- 
ated by this type of study. Despite the fact that a number 
of biological, physiological and epidemiological field 
studies have provided evidence linking BT consumption 



and glucose metabolism, 16 17 22 42-44 51 52 a large-scale 
randomised controlled trial of tea consumption and dia- 
betes risk would be useful to confirm these findings. 

Data mining and data dredging 

Beyond the causality issue, a frequent criticism of using 
data mining was based on the confusion between data 
mining and data dredging techniques. While a data 
mining approach is based on searching for combinations 
of variables that might show potential correlations, data- 
dredging (also called 'data fishing') can generate mis- 
leading results. 58 When a number of hypotheses are 
tested, it is expected that some will falsely appear to be 
statistically significant, since every database can contain 
potential random correlations. A robust data mining 
approach must therefore always be based on a clear 
research strategy and a limited number of relevant 
meaningful assumptions. In our assessment, we used a 
systematic data mining approach to test potential corre- 
lations between six selected variables (BT consumption 
and five key health indicators) . PGA was used to describe 
and structure the dataset before testing any correlations. 
In our study, only one linear correlation model was con- 
structed between BT consumption and diabetes preva- 
lence, based on the most relevant association suggested 
by the PCA. This consistent approach is quite different 
from screening numerous cross-regression analyses 
between all variables of one particular dataset. The data 
mining approach can be considered to be a 'radar track- 
ing system', allowing detection, tracking and classifica- 
tion of potential 'targets' in the framework of a 
particular environment. This is particularly useful when 
exploring complex databases, as data mining can iden- 
tify original statistical evidence, which would never be 
discovered by means of classical statistical techniques. 
As an example, the significant progress in genomics 
would not have been possible without the use of data 
mining techniques. Despite the data collection homogen- 
eity issue inherent to large cross-country comparisons, we 
believe that this multidimensional approach can provide 
valuable additional scientific information, completing 
published biological, physiological and epidemiological 
studies conducted on tea consumption, diabetes and 
obesity. These results should support further causality 
research regarding the health benefits of BT consump- 
tion on type 2 diabetes prevalence in the world. 
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